Correspondence-guided Synchronous Parsing of Parallel Corpora
نویسنده
چکیده
We present an efficient dynamic programming algorithm for synchronous parsing of sentence pairs from a parallel corpus with a given word alignment. Unless there is a large proportion of words without a correspondence in the other language, the worstcase complexity is significantly reduced over standard synchronous parsing. The theoretical complexity results are corroborated by a quantitative experimental evaluation. Our longer-term goal is to induce monolingual grammars from a parallel corpus, exploiting implicit information about syntactic structure obtained from correspondence patterns.1 Here we provide an important prerequisite for parallel corpusbased grammar induction: an efficient algorithm for synchronous parsing, given a particular word alignment (e.g., the most likely option from a statistical alignment). Synchronous grammars. We assume a straightforward extension of context-free grammars (compare the transduction grammars of [Lewis II and Stearns, 1968]): (1) the terminal and non-terminal categories are pairs of symbols (or NIL); (2) the sequence of daughters can differ for the two languages; we use a compact rule notation with a numerical ranking for the linear precedence in each language. The general form of a rule is N0/M0 → N1:i1/M1:j1 . . . Nk:ik/Mk:jk , where Nl, Ml are NIL or a (non-)terminal symbol for language L1 and L2, respectively, and il, jl are natural numbers for the rank in the sequence for L1 and L2 (for NIL categories a special rank 0 is assumed). Compare fig. 1 for a sample analysis of the German/English sentence pair Wir müssen deshalb die Agrarpolitik prüfen/So we must look at the agricultural policy. We assume a normal form in which the right-hand side is ordered by the rank in L1. The formalism goes along with the continuity assumption that every complete constituent is continuous in both languages.3 Synchronous parsing. Our dynamic programming algorithm can be viewed as a variant of Earley parsing and generation, which again can be described by inference rules. For Cp. the new PTOLEMAIOS project at Saarland University (http://www.coli.uni-saarland.de/ ̃jonask/PTOLEMAIOS/). However, categories that are NIL in L1 come last. If there are several, they are viewed as unordered with respect to each other. As [Melamed, 2003] discusses, such an assumption is empirically problematic with binary grammars. However, if flat analyses are assumed for clauses and NPs, the typical problematic cases are resolved. instance, the central completion step in Earley parsing can be described by the rule4 (1) 〈X → α • Y β, [i, j]〉, 〈Y → γ •, [j, k]〉 〈X → α Y • β, [i, k]〉 The input in synchronous parsing is not a one-dimensional string, but a pair of sentences, i.e., a two-dimensional array of possible word pairs (or a multidimensional array if we are looking at a multilingual corpus). The natural way of generalizing context-free parsing to synchronous grammars is thus to use string indices in both dimensions. So we get inference rules like the following (there is another one in which the i2/j2 and j2/k2 indices are swapped between the two items above the line): (2) 〈X1/X2 → α • Y1:r1/Y2:r2 β, [i1, j1, j2, k2]〉, 〈Y1/Y2 → γ •, [j1, k1, i2, j2]〉 〈X1/X2 → α Y1:r1/Y2:r2 • β, [i1, k1, i2, k2]〉 Since each inference rule contains six free variables over string positions (i1, j1, k1, i2, j2, k2), we get a parsing complexity of order O(n) for unlexicalized grammars (where n is the number of words in the longer of the two strings from L1 and L2) [Wu, 1997; Melamed, 2003]. Correspondence-guided parsing. As an alternative to standard “rectangular indexing” we propose an asymmetric approach: one of the languages (L1) provides the “primary index” – the string span in L1 like in monolingual parsing. As a secondary index, L2 contributes a chart-generationstyle bit vector of the words covered, which is mainly used to guide parsing – i.e., certain options are eliminated. A complete sample index for müssen/must in fig. 1 would be 〈[1, 2], [00100000]〉. Completion can be formulated as inference rule (3).5 Condition (iii) excludes discontinuity in passive chart items, i.e., complete constituents; active items (i.e., partial constituents) may well contain discontinuities. (3) 〈X1/X2 → α • Y1:r1/Y2:r2 β, 〈[i, j], v〉〉, 〈Y1/Y2 → γ •, 〈[j, k], w〉〉 〈X1/X2 → α Y1:r1/Y2:r2 • β, 〈[i, k], u〉〉 where (i) j 6= k; (ii) OR(v,w) = u; (iii) w is continuous (i.e., it contains maximally one subsequence of 1’s). A chart item is specified through a position (•) in a production and a string span ([l1, l2]). 〈X → α • Y β, [i, j]〉 is an active item recording that between position i and j, an incomplete X phrase has been found, which covers α, but still misses Y β. Items with a final • are called passive. We use the bold-faced variables v,w,u for bit vectors; OR performs bitwise disjunction on the vectors.
منابع مشابه
Parsing Word-Aligned Parallel Corpora in a Grammar Induction Context
We present an Earley-style dynamic programming algorithm for parsing sentence pairs from a parallel corpus simultaneously, building up two phrase structure trees and a correspondence mapping between the nodes. The intended use of the algorithm is in bootstrapping grammars for less studied languages by using implicit grammatical information in parallel corpora. Therefore, we presuppose a given (...
متن کاملGrammar Comparison Study for Translational Equivalence Modeling and Statistical Machine Translation
This paper presents a general platform, namely synchronous tree sequence substitution grammar (STSSG), for the grammar comparison study in Translational Equivalence Modeling (TEM) and Statistical Machine Translation (SMT). Under the STSSG platform, we compare the expressive abilities of various grammars through synchronous parsing and a real translation platform on a variety of Chinese-English ...
متن کاملCollocation Translation Acquisition Using Monolingual Corpora
Collocation translation is important for machine translation and many other NLP tasks. Unlike previous methods using bilingual parallel corpora, this paper presents a new method for acquiring collocation translations by making use of monolingual corpora and linguistic knowledge. First, dependency triples are extracted from Chinese and English corpora with dependency parsers. Then, a dependency ...
متن کاملتأثیر ساختواژهها در تجزیه وابستگی زبان فارسی
Data-driven systems can be adapted to different languages and domains easily. Using this trend in dependency parsing was lead to introduce data-driven approaches. Existence of appreciate corpora that contain sentences and theirs associated dependency trees are the only pre-requirement in data-driven approaches. Despite obtaining high accurate results for dependency parsing task in English langu...
متن کاملاستخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کامل